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1. Introduction 

Principal component analysis (PCA) is a popular tool in multivariate statis- 
tics. However, PCA estimates may be highly influenced by certain types of 
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observations and, as such, it is often important to locate, and perhaps subse- 
quently treat, those observations which may be potentially harmful in practice. 
Influence analysis of PCA estimators (see, for e.g., [9; 3; 31; 11] and [14]) show 
that the direction of an observation plays a crucial role in how influential it 
may be on the eigenvector and eigenvalue estimates. As such, influential obser- 
vations may or may not be detectable using common distance measures and can 
therefore be difficult to locate. 

Throughout we will consider PCA of a symmetric matrix where the set of 
all eigenvectors satisfies orthonormality conditions. Often it is the span of a 
subset of eigenvectors that is of primary interest rather than individual elements. 
Let A = [ai, ...,ax] and B = [bi,...,bif] be two p x K matrices where 
HaJ = 1, ||b»i || = 1 (i = 1,...,K) and a^a,' = 0, hjbj = for all i ^ j. Two 
common measures for the comparison between the column spaces of A and B 
arc the RV(A,B) coefficient ([15; 27]) and the GCD(A,B) measure [32]. Let 
Pa = AA T and Pb = BB T denote the projection matrices onto the column 
spaces of A and B respectively. Then, due to orthonormality of the columns of 
A and B (see, for e.g., [3]) we have 

RV(A, B) = GCD(A, B) = -Urace (P A P B ) • (1) 

K 

In considering influence on eigenvector subset spans, Benasseni [3] noted that 
the RV and GCD measures were insensitive to small perturbations. Benasseni 
then introduced a new measure for assessing sensitivity given as 

1 K 

pi(A,B) = l--^||a fc -P B a*|| (2) 

fe=i 

or, alternatively, P2(A, B) = pi(B, A). It should be noted that it is not necessar- 
ily true that P2(A, B) = pi(A, B) so that, unlike the RV and GCD measures, the 
ordering of the arguments may be important. However, by considering a small 
adjustment in pi (A, B) that considers summation of ||afc — PBafc|| 2 instead of 
||afc — PeafcH, note that 

1 K 

1 - k 2 l|afc _ PBafc " 2 = RV ( A ' B ) = GCD(A, B) 

k=l 

so that there is a strong link between Benasseni's measure and the RV and GCD 
measures. The purpose of this paper is then to consider an influence measure 
based on the RV and GCD measures. 

In Section 2 we consider perturbation of the RV and GCD measures and in- 
troduce a measure for the influence analysis of eigenvector spans. We then apply 
this measure to some example estimators in Section 3 to show how it compli- 
ments existing influence studies. Sample versions are discussed in Section 4 for 
the detection of highly influential observations in practice. In Section 5 we show 
how influential observations may be efficiently detected in practice with respect 
to a high dimensional data set. 
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2. The effect of perturbation on the RV and GCD measures 

Consider the contamination distribution 

F x (e) = (1 - e)F^ + eA x (3) 

where -F^s is some distribution with mean fi, and covariance matrix £, < 
£ < 1 and A x is the Dirac measure putting all of its probability mass at the 
comtaminant x. When convenient we may also utilize z = S _1 / 2 (x — /x) (the 
standardized contaminant at F/^s) and the population Mahalanobis distance 
with respect to x at F Mi £ given as 

MD M , s (x) = v /(x-/i)TS-i( x - M ) = || Z ||. 

Let W denote apxp statistical functional where, at an arbitrary distribution 
G for which it exists, W(G) is symmetric. With respect to F x (e) defined in (3), 
we are interested in perturbation of the form 

W{F x {e)) = W(F^) + eWi + e 2 W 2 + 0(s 3 ) (4) 

where Wi and W2 are p x p symmetric matrices independent of e. 

Under perturbation of the form given in (4), the first order coefficient of e 
is the influence function ([16; 17]) for W at denoted IF(W, x). The 

influence function is a useful tool for understanding the sensitivity of estimators 
to small perturbations. For example, if IF(PF, -F Mi £; x) = then, for small e, 
W(F x (e)) « W(F <Xi s) so that small perturbations with respect to x have little 
influence on the estimator. On the other hand, if IF(W, F Mi s; x) is large then 
perturbation with respect to x is highly influential since it effects a large change 
on the estimator. 

Let |«i| > \k 2 \ > ■ • ■ > I ftp I be the eigenvalues of W{F I1 ^) and let Ui,...,u p 
denote the corresponding eigenvectors. We are interested in the effect that per- 
turbation has on the span of a subset of the eigenvectors. Let S C {1, . . . ,p} and 
let Ps = X^jes v 3 v j denote the projection matrix onto the subspace spanned 
by the elements of {uj : j e S}. Similarly, let Ps(e) = J2jes u j{ £ ) u j{ £ ) T denote 
the perturbed equivalent at F x (e) where u\ (e),. .. ,u p (e) are the eigenvectors 
corresponding to the ordered absolute eigenvalues of W(F x (e)). Typically, S will 
be chosen to be {1, ... , K} such that the span of the eigenvectors corresponding 
to the K largest absolute eigenvalues is of interest. For example, in PCA corre- 
sponding to covariance matrices, principal components with corresponding large 
eigenvalues are retained as they can account for most of the total population 
variance. Let S' denote the compliment of S. The following condition will also 
be used. 

Condition 1. For each j £ S and r £ 5', Kj ^ n r . 

We now look at a measure based on the expansion of a function of the RV 
and GCD measures based on the above condition. The proof is given in the 
Appendix. 



L.A. Prendergast/Sensitivity of PCA subspaces 



457 



Theorem 1. Consider S, Ps, Ps(e), «n (i = 1, . . . ,p) and Vi (i = 1, . . . ,p) 

defined previously and let 



1 



1- -^trace{P s Ps(£)} 



£ 

Then, under the perturbation form given in (4) and Condition 1, 

where K is the number of elements in the set S and IF(W, -F^,£;x) is t/ie m/Zu- 
ence function for W at F^x . 

From Theorem 1, if perturbation has resulted in no difference between the 
span of the non-perturbed and perturbed eigenvectors then ps(W, -F^s; x) = 
since trace {PsPs(e)} = K. However, as the distance between the spans 
increases according to 

-|trace{PsPs(e)} 

(i.e, as the RV and GCD measure for the comparison between the spans ap- 
proaches zero) then ps(W, F^s; x) increases. Also, for small £ we have that 
ps(W, Ffj, t s; x) is approximately equal to the second order coefficient to e in the 
expansion of [1 — trace {PsPs(£)} / K]. Hence, the following influence measure 
will be utilized throughout, 

p s (W,F^; X )=hrap s (W,F^;x) = -gg ^——^ . 

(5) 

In the case of K = 1 such that j € {1, . . . ,p}, 

Ps(H/,^ s ;x) = ||IF(^,^ s ;x)|| 2 

where Vj is the functional for the jth eigenvector estimator and IF(z/j, F^ejx) 
is the associated influence function. Hence, the measure may be used to assess 
influence on individual components as well as the span of a subset of components. 

The ps(W, F^s; x) measure provides a convenient means to understand the 
sensitivity of eigenvector estimators in the presence of non-unique eigenvalues. 
IF(Vj, x) is only known for the case of unique Kj (see, for e.g. [9] or [11]) 
and, as such, is not useful in all situations. For example, let J C {1, ■ ■ ■ ,p} 
such that {k; = Kj : i,j 6 J, i ^ j} then the solution to JF(Uj, F^x] x) is not 
known for j € However, from (5), ps(W, F^.s; x) is known provided J C 5 
or C 5" and Condition 1 holds. 

3. Example estimators 

3. 1. Covariance and correlation matrix estimators 

Let Co denote the functional for the classical covariance matrix estimator 
where, at F^s, Cq(F Mi s) = S with eigenvalues Ai > • • • > X p and correspond- 
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ing eigenvectors 771, . . . , T] p . The influence function for this estimator (see, for 
e.g. [9]) is IF(C ,i^,s;x) = (x - /x)(x - /x) T - S so that, from (5), 

^o,F M , s ;x) = l^^-^^. (6) 
where yj = vj ~ A 4 )- 

Remark 1. Benasseni's coefficient, as shown in (2), was introduced in the 
classical covariance matrix setting. This measure was based on the average sine 
of the angle between each of the perturbed eigenvectors and their projection onto 
the non-perturbed subspace. Using our notation the influence measure associated 
with this coefficient is (see [3]) 

1 v f v y)y 2 r \ 1/2 

jes UgS' v 1 r > ) 

which contains similar sensitivity information as the ps(Co, f/x,s! x ) measure. 

Similarly, let Rq denote the functional for the classical correlation matrix 
estimator where, at -F M ,s, i?o s) = T and let ot\,...,a. p and 71,..., 7 P 
denote the eigenvalues and associated eigenvectors of T. For Xi denoting the ith 
element of x, \ii denoting the ith element of /x and an denoting the ith diagonal 
element of X, let z T = [zi, . . . , z p ] — [(x\ — /xi)/<rn, . . . , (x p — n P )/a pp ] and 
D = diag(2x, • • • , 2^). The influence function for this estimator is, from [13], 
IF(i? ,i^, iS ;x) = zz T - (Dr + TD) /2 so that, from (5), 

ps(Ro,F^-z;x) = j7 X] 7 Z — ^2 I ~ ^ ( Q J + a r) 77 D 7r [ ■ (7) 

We will now provide an example of the form of the influence measure for 
eigenvector subspaces of covariance estimators. We will not only consider the 
measure for the classical case as shown in (6), but also with respect to two 
robust estimators of the covariance matrix; namely the one-step re-weighted 
Minimum Covariance Determinant (RMCD) estimator which includes an initial 
MCD estimation step [28] followed by a subsequent re- weighting [21] and the 
S-estimator ([29; 30; 12]). For simplicity and to satisfy Fisher consistency at 
the non-contaminated model we will assume F^ y, is multivariate normal. For 
the RMCD estimator we choose the breakdown point for the initial MCD es- 
timator to be a = 0.5 followed by the retention of 97.5% mass that satisfies 
MD^ s ,(x) < g. 975 where P(xl < g.975) = 0.975 and /x* and S* are the MCD 
mean vector and covariance matrix. For the .^-estimator we use the minimizcr 
function associated with Tukcy's biweight function (see, for e.g., Example 2.2 of 
[20]) and a — 0.5. Associated influence functions for these estimators that are 
used in the following example can be found in [10] and [20]. 
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(a) Classical estimator (b) RMCD estimator (c) S estimator 




Fig 1. Subspace sensitivity plots for Example 1 with K = 3 for (a) the classical estimator (b) 
RMCD estimator with a = 0.5 and 5 = 0.025 and (c) the S -estimator. 

Example 1. Let Ai = ■ • • = Xk and Aa'+i = ■ ■ ■ = A p with A p < Ai, X = 
diag(Ai, . . . , A p ), fi — and S = {1, . . . , K}. From (6), and since each yi = 
\y 2 r]Jz for z = £ _1 / 2 x, then ps(Co, i^,s! x ) is equal to 

AiAp{A'(Ai - A p ) 2 }~ 1 trace(P s zz T ) |MD 2 l S (x) - trace(P s zz T )} . 

Let 9 denote the angle between z and Pgz (its projection onto the if -eigenvector 
subspace) then cos(6>) = tracc(Pszz T )/ {AlD Mi s(x)||Psz||} which then gives 
trace(P s zz T ) = MD 2 i S (x) cos 2 (6>) (since ||Psz|| = v /tracc(P s z)) so that 

Ps(C»,F^x) = Xl ^}ffi cos 2 W {l^c s 2 (^)}. 

In Figure 1 we plot ps(Co, Ffi,s', x) for Example 1 and the corresponding 
measures for the RMCD and S'-estimators as described above. As can be seen 
in plot (a), the classical estimator can be highly influenced by x, in particular 
those x with a large MD^^W- However, the angle of x from its projection onto 
the subspace spanned by [rji, . . . ,t]k] also plays an important role. Regardless 
of the magnitude of MD fi) s(x), x has zero influence when z 6 Span{?7j : j G S} 
or z 6 Span{?7 r : r G S'}. This is also the case for the RMCD and S'-estimator 
though the downweighting of observations with a large MD Mi s(x) results in 
reduced influence as is seen in plots (b) and (c). Let q% be the £ x 100% percentile 
for the Xp distribution such that P(Xp ^ 9?) = £■ F° r the RMCD estimator 
there are discontinuities at MD^e(x) = q a and MD Mi s(x) = qs corresponding 
to the rejection of observations in the initial weighting and rc-wcighting steps 
respectively. The estimator is particularly sensitive at these points since small 
changes in x can effect a large change in influence. From an influence perspective 
the S'-estimator is preferred since it does not suffer from comparatively high 
influence and the smooth weighting function utilized results in smooth changes 
in influence with respect to changes in MD pl ,s(x). 
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3.2. Dimension reduction methods 

In the regression setting, consider a univariate response variable Y and p- 
dimensional predictor vector X. If there exists a p x K matrix B such that 
Y ALX\B T X then, when K < p. dimension reduction can be achieved without 
loss of information by replacing the p-dimcnsional X with the if-dimcnsional 
B T X. For more information see, for e.g., [7]. Let S denote the column space of B. 
Under appropriate conditions, methods such as Sliced Inverse Regression (SIR, 
[18]), Sliced Average Variance Estimates (SAVE, [(>]) and Principal Hessian 
Directions (PHD, [19]) seek a basis for S. The methods are based on an eigen- 
analysis of a p x p symmetric matrix so that, provided the influence function for 
the symmetric matrix estimator is known, the influence measure resulting from 
Theorem 1 is applicable in this setting. 

Influence functions for versions of SIR, SAVE and PHD that return an or- 
thonormal basis for B have been considered where influence on the directions 
of the basis is carried out with respect to Benasseni's measure (see [23; 24] and 
[25]). As an example we will consider PHD since the method requires less intro- 
duction. Assuming X ~ N p (fi, X), [19] showed that eigenvectors corresponding 
to non-zero eigenvalues of the the average Hessian matrix given as 

H x = S-^Y - E(Y)} {X -n}{X- /x} T ] ST 1 

are elements of iS. For more on PHD, including alternative versions, see [19] 
or [8]. 

In this regression context, the contamination distribution becomes 

F x (e) = (1 - e)F^ + sA y>x 

where A yjX has all of its probability mass at (y, x) G W +1 to allow for con- 
tamination in both the response and predictor spaces. We shall assume that 
F^.n = N p ((i, X) and rank(H x ) = K so that PHD is capable of finding a com- 
plete basis for S. Let H denote the functional for the usual estimator of H x . 
From [25] the influence function for H at -F^s as 

IF(H, F^;y, x) ={y - E(Y)} {ww T - 5T 1 } - w {w T SH x + bJ LS } 

- {H x Sw + b OLS }w T -H x (8) 

where w = S _1 (x — fi), b OLS is the OLS slope vector at F Mj s and E(Y) is 
the expected value of Y at F^.s. We are interested in the basis estimator for 
S so we set S = {1, ■ ■ • , K} where each A, (i € S) is a non-zero eigenvalue 
with corresponding eigenvector r)i € S. We also have (see, for e.g., [4; 5] or [22]) 
b OLS S S so that, and since H x rj r = (r G 5'), from (5), 



(H, F^y, x) =1 ]T ^ ]T [ {y - E(Y)} ( Wj w r - ry/S" 1 ^) 
jes i res' 

1 2 

- (AjTyTEw + Tj/boLs) w r (9) 



where w m = rj^w. 
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The unboundedness of ps(H, F^sjy, x) is clearly evident suggesting that 
some observational types may be highly destructive to the estimator. In the 
original PHD paper by Li [19], it was noted that the method can be highly 
sensitive to outlying observations in the response space. The model itself imposes 
distributional restrictions on the predictor (i.e. normality) meaning that outliers 
in the predictor space may be more formally identified and subsequently treated. 
However, this is not the case with the response where outlying observations 
may still contain important regression information. However, ps{H, i^.s; y, x) 
increases without bound as y is moved further from E(Y) suggesting that such 
observational types can be highly influential. 

It is also interesting to note the types of observations that are not influential. 
For example, suppose that fi — and £ = 0. Then, from (9), we have that 
ps (H, F^j;; y, x) = if x is an element of S. That is, even extreme outliers may 
not be influential. 

4. Sample versions 

In practice it is common to consider the effect of highly influential observa- 
tions on sample estimates. A limitation, however, is in the difficulty and inef- 
ficiency of locating such observations in large data sets. In this section we will 
consider sample based versions of the influence measure for the detection of 
influential observations. 

Let Xi , . . . , x„ denote a random sample of size n where each Xj S MP and let 
F n denote the empirical distribution of this data. Similarly, suppose that F n ^ 
denotes the empirical distribution of the data without the ith observation so 
that 



Throughout, all reference will be to data of this form though in the regression 
setting one would need to consider observational pairs consisting of a predic- 
tor and response. Sample based versions of the influence function for the zth 
observation have been employed (see, for e.g., [9]) where the contaminant is x» 
and e is replaced with — l/(n — 1). For P and Pg,(i) denoting the projection 
matrix estimates with an without the ith observation, the true sample version 
of ps(W, -F/x.e; x) is then 



Computation of rg(W, F n ; x,) for all n observations requires estimation at 
each of F n ,F n n\, . . . , F n r n y Such a process can be extremely inefficient when 
n is large or p is large (or both) due to the time it can take to carry out an 
eigen-analysis n+ 1 times. An approximation to rs(W, F„; x^) may be computed 
by replacing unknown population parameters in ps(W, -F^.s; x) with their re- 





(10) 
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spcctivc estimates at F n . This approximate influence value is, from (5), 

r s (W,F n ; Xi ) ^EE ( % -^ < U ) 

jesres' y r ' 

where EIF(W, F n ; x,) is the empirical influence function consisting of estimates 
at f n in place of population parameter values in IF(H^ x). 

When IF(W, -F^sjx) exists in a closed form, i.e. in terms of x and population 
parameters only, then ?s(W, F n ;Xi) may be calculated for each observation after 
just one eigen-analysis at F„ . In the next section we will highlight the usefulness 
of this approximation in the context of computation time. 



5. Sample principal components of the classical covariance matrix 
estimator: A microarray application 

In this section we consider the colon tumor microarray data set [1]. For sim- 
plicity we consider the first K estimated eigenvectors of the sample covariance 
matrix estimate S. Computation of rg(W, F n ; x,) for even just a few observations 
can be extremely inefficient for high-dimensional data sets. If r$(W, F n ;Xi) is to 
be calculated for all observations then such an analysis may become extremely 
onerous. We will therefore consider efficiently approximating rs(W, F n ; Xj) with 
fs(W, F n ; x,;). All results were obtained using R version 2.5.1 and the R func- 
tion eigen for the eigen-analysis which utilizes LAPACK routines (see [2]) for 
computation. An Intel Pentium D CPU 3.60GHz with 1.99GB of RAM was used 
for the analysis. 

The colon tumor microarray data set consists of gene expression measure- 
ments for 2000 genes corresponding to n = 62 samples. Each sample is either 
classified as being a 'normal tissue' sample or a 'tumor tissue' sample. Of the 40 
individuals in the study, each has an associated 'tumor tissue' sample and 22 of 
the individuals also have a 'normal tissue' sample. We consider the normalized 
data where each sample is standardized to have mean and standard deviation 
1. Although this is a subset of a larger data set consisting of 6500 genes, most 
statistical research has concentrated on just the 2000 genes. We chose this data 
since it is often used in discriminant analysis but classical methods are not im- 
mediately applicable due to the singularity of S. As such methods such as PCA 
may be used to initially reduce the dimension of the predictor space. For more 
information regarding this data set see [1]. 

From (6), 

f S (Co,F n ; Xi ) = j-^^ (12) 



1 V- V- VjiVr 
■jeSreS' (A; - ^rY 



where yji = fjj (xj — x) , x is the sample mean and 771, . . . , rf p are the sample 

eigenvectors corresponding to the sample eigenvalues Ai > A2 > ■ ■ • > A p of S. 
Potential efficiency problems may still exist when the number of loop repetitions 
is large. A total of K x (p — K) iterations are required for the computation 
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Table 1 

Computation time in seconds for computation of all 62 rglW, F n ;Xi)'s (T r ), fg(W,F n ;Xi)'s 
(Tf ) and f^(W, F n ;Xi) 's (Ti ) for the tumor data with S = {1, . . . , K}. The spearman rank 
correlation between the rs{W,F n ;x.i)'s and fg(W, F n ; Xj) 's, SR,s(r,f), is also reported. 



s 




{1} 


{1,2} 


{1,2,3} 


{1,...,4} 


{1,...,5} 


T r 




2953.24 


2951.47 


2950.50 


2952.33 


3021.73 


Tf 




30.26 


31.56 


32.78 


33.91 


36.27 


I " 

r 




29.41 


29.37 


29.50 


29.52 


29.56 


SR s (r, 


f) 


0.995 


0.993 


0.975 


0.929 


0.962 


S 




{!,•••, 6} 


{1,...,7} 


{!,••■) 8} 


{!,-•■, 9} 


{1,...,10} 


T r 




3044.42 


2953.89 


2957.11 


2955.84 


2959.18 


Tf 




36.11 


37.34 


39.08 


40.13 


41.60 


rp* 




29.56 


29.60 


29.70 


29.67 


29.70 


SRslr, 


f) 


0.963 


0.958 


0.954 


0.958 


0.967 


S 




{1.....11} 


{1,.-,12} 


{1,. . . ,13} 


{1.....14} 


{1,...,15} 


T r 




2957.14 


2957.09 


2962.31 


2959.51 


2958.94 


Tf 




41.41 


43.01 


44.74 


46.14 


47.33 


rj i * 




29.78 


29.74 


29.75 


29.78 


29.77 


SR s (r, 


f) 


0.958 


0.960 


0.940 


0.904 


0.913 


S 




{!,..., 16} 


{!,..., 17} 


{1,. . . ,18} 


{1,...,19} 


{1,. . . ,20} 


T r 




2959.51 


2961.66 


2960.95 


2961.72 


2962.49 


Tf 




49.16 


48.81 


50.53 


53.01 


53.53 






29.81 


29.86 


29.84 


29.88 


29.92 


SR s (r, 


f) 


0.875 


0.747 


0.814 


0.777 


0.656 



of a single rs(Co, F n ;yii) so that the total number of iterations required for 
the computation of rs{Co,F n ; Xi), . . . , rs(Cb, F n ; x„) is n X K X (jp — K). For 
example, if we consider n = 62, p = 2000 and let K = 10, then the total number 
of iterations amounts to 1,239,380. However, this can be greatly reduced when 
p >> n by noting that, since rank(S) < n — 1 giving rj^Srjk = for k > n — 1 
then yki = for all i = 1, . . . , n when k = n, . . . ,p. Hence 

jeSr<n-l y X 3 ~ Ar > 

which requires just K (n — 1 — K) iterations. Again for n = 62 and p = 2000 the 
total number of iterations required for a choice of K = 10 is only 31620; just 
2.55% of the iterations required for (12). 

In Table 1 we provide the time in seconds taken to compute all rs(W, F n ; Xj)'s 
and the approximations using fs(W,F n \Xi) and fg(W, F n ;Xi). To highlight 
how well the approximation can be used to detect influential observations wc 
also included the Spearman rank correlations between the rs(W, F n ; Xj)'s and 
f* s (W, F n ; Xj)'s. It is immediately evident that much time can be saved when 
using the approximations. For example, for K = 2 the true computation cost 
is 2951.24 seconds compared to just 29.37 seconds (or around 1% of the time) 
using the r* s (W, F n ; x»)'s. The high Spearman rank correlation of 0.993 also in- 
dicates that f* s (W, F n ;K.i) is an excellent indicator of influential observations. 
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(a)S={1> (b)S={1,2,3} 

1 o _i 




Fig 2. Values of r$(W, F n ; x^) 's (solid line) and f* s {W, F n \ Xi)'s ( dashed line) for the tumor 
data with (a) S = {1}, (b) S = {1, 2, 3} (c) S = {!,..., 6} and (d) S = {!,..., 12}. 

High correlations over 0.9 are maintained for all choices of K up until and in- 
cluding K = 15. It is also clear that calculating the approximation based on 
the fewer loop iterations given in (13) is also beneficial with all computations 
coming in under 30 seconds compared to computation times approaching 60 
seconds for larger choices of K when (12) was utilized. 

Further evidence of how close the approximation is to the true sample influ- 
ence measure is provided in Figure 2. Here wc plot the rs{W, F n ; Xj)'s versus the 
rg(W, F n ; Xj)'s for all observations and some choices of K. For K = 1 and 3 there 
is little difference between the true and approximate values. For K = 6 and 12, 
although there are obvious differences for some observations the approximation 
is still highly successful in highlighting influential observations. 

6. Discussion 

In this paper we considered sensitivity analysis of subsets of eigenvectors 
based on perturbation of the GCD and RV measures. Examples were provided 
to show how this analysis compliments existing influence studies in Section 3. 
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In Section 5 wc highlighted how an approximate sample version of the mea- 
sure can be used to efficiently detect influential observations in practice. The 
data set considered consisted of 2000 measurement variables for 62 individuals so 
that a leave-one-out sensitivity analysis requiring repeated principal component 
estimation was computationally inefficient. However, the approximate version 
provided an excellent approximation to the true sample measure when the sub- 
set of eigenvectors was not too large. This approximate version was much less 
time consuming to compute, therefore offering a useful means to assess influence 
for large data sets. 



Appendix A: Proof of Theorem 1 

Using well-known eigen perturbation theory (see, for e.g., [26]), it is straight- 
forward to show that Ps(e) can be represented by the convergent power series 



Ps(e) 



-e 2 P 2 +0(e 3 ). 



(14) 



For S of the form {1, . . . , K}, [31] give the form of Pi and P 2 . It is however, 
a simple generalization to use Sc{l,...,p} and S' in place of {1, ... , K} and 
{K + 1, . . . ,p} respectively and simply let K equal the number of elements in S. 

Note that, since (I — Ps) is a projection matrix, 



K - trace [P s Ps(e) 



trace[(I-Psr)P s (e)] 
trace[(I-P s )P s (e)(I-Ps)] 



= etrace 



(I-P s )(P 1 + -eP 2 )(I-P s ) 



+ 0( £ 3 ) 



from (14) and since (I - Ps)Ps = 0. 

The proof is complete by noting that, from [31], (I — Ps)Pi(I — Ps) = and 



trace [(I - P S )P 2 (I - Ps)] = trace 



2 EEE 

res' teS' jes 



2 j2 J2 VV| "' 



since uj v t =0 (r^ t) and uj u r = 1. 
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